Client Report - Can You Predict That?

Course DS 250

Author

HENRY FELIPE

Show the code
import pandas as pd 
import numpy as np
from lets_plot import *
# add the additional libraries you need to import for ML here

LetsPlot.setup_html(isolated_frame=True)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here

# import your data here using pandas and the URL

Elevator pitch

A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)

A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.

QUESTION|TASK 1

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.

type your results and analysis here

Show the code
# Include and execute your code here


# Load data
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
df = pd.read_csv(url)

# Label target variable
df["before1980_label"] = df["before1980"].map({1: "Before 1980", 0: "1980 or newer"})

# Chart 1: Boxplot of Net Price
ggplot(df, aes(x='before1980_label', y='netprice', fill='before1980_label')) + \
    geom_boxplot() + \
    scale_y_log10() + \
    labs(
        title="Net Price vs Home Age Category",
        x="Home Built",
        y="Net Price (log scale)"
    ) + \
    theme_bw()
Show the code
# Load data
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
df = pd.read_csv(url)

# Label target variable
df["before1980_label"] = df["before1980"].map({1: "Before 1980", 0: "1980 or newer"})

# --- CHART 2: Livearea Density Plot by Before1980 ----
ggplot(df, aes(x='livearea', color='before1980_label', fill='before1980_label')) + \
    geom_density(alpha=0.4) + \
    scale_x_log10() + \
    labs(
        title="Distribution of Home Size (Livearea) by Home Age Category",
        x="Live Area (square feet, log scale)",
        y="Density"
    ) + \
    theme_bw()

QUESTION|TASK 2

Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.

type your results and analysis here

Show the code
# Include and execute your code here

# ---------------------------------------------
# TASK 2: Classification Model
# ---------------------------------------------

# -------------------------------------------------------
# QUESTION | TASK 2 — Build Classification Models
# -------------------------------------------------------

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# ----------------------------------------------
# Load Data
# ----------------------------------------------
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
df = pd.read_csv(url)

# ----------------------------------------------
# Clean data & remove target leakage
# ----------------------------------------------
# "tasp" directly reveals sale price; it MUST be removed
df = df.drop(columns=["tasp"], errors="ignore")

# Target variable
y = df["before1980"]

# Feature matrix
# Remove parcel ID and the target column
X = df.drop(columns=["before1980", "parcel"])

# One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True)

# ----------------------------------------------
# Train/Test Split
# ----------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# ----------------------------------------------
# Model 1 — Logistic Regression
# ----------------------------------------------
log_model = LogisticRegression(max_iter=2000)
log_model.fit(X_train, y_train)
log_pred = log_model.predict(X_test)
log_acc = accuracy_score(y_test, log_pred)

# ----------------------------------------------
# Model 2 — Decision Tree
# ----------------------------------------------
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
tree_pred = tree.predict(X_test)
tree_acc = accuracy_score(y_test, tree_pred)

# ----------------------------------------------
# Model 3 — Random Forest (Final Choice)
# ----------------------------------------------
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)

# ----------------------------------------------
# Display all accuracies
# ----------------------------------------------
log_acc, tree_acc, rf_acc
(0.9977308430790713, 1.0, 1.0)

QUESTION|TASK 3

Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.

type your results and analysis here

Show the code
# Include and execute your code here

QUESTION|TASK 4

Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 1

Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 2

Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 3

Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.

type your results and analysis here

Show the code
# Include and execute your code here